Goto

Collaborating Authors

 factored mdp






Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting

Neural Information Processing Systems

We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs). We propose two near-optimal and oracle-efficient algorithms for FMDPs. Assuming oracle access to an FMDP planner, they enjoy a Bayesian and a frequentist regret bound respectively, both of which reduce to the near-optimal bound $O(DS\sqrt{AT})$ for standard non-factored MDPs. We propose a tighter connectivity measure, factored span, for FMDPs and prove a lower bound that depends on the factored span rather than the diameter $D$. In order to decrease the gap between lower and upper bounds, we propose an adaptation of the REGAL.C algorithm whose regret bound depends on the factored span. Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.


Oracle-Efficient Regret Minimization in Factored MDPs with Unknown Structure

Neural Information Processing Systems

We study regret minimization in non-episodic factored Markov decision processes (FMDPs), where all existing algorithms make the strong assumption that the factored structure of the FMDP is known to the learner in advance. In this paper, we provide the first algorithm that learns the structure of the FMDP while minimizing the regret. Our algorithm is based on the optimism in face of uncertainty principle, combined with a simple statistical method for structure learning, and can be implemented efficiently given oracle-access to an FMDP planner. Moreover, we give a variant of our algorithm that remains efficient even when the oracle is limited to non-factored actions, which is the case with almost all existing approximate planners. Finally, we leverage our techniques to prove a novel lower bound for the known structure case, closing the gap to the regret bound of Chen et al. [2021].



details and add more discussions on related works in the camera-ready version

Neural Information Processing Systems

We thank all reviewers for valuable comments. Entropy is used to measure sufficiency, compactness and uniqueness . The usage of variance to approximate entropy was discussed in L203. Therefore, the performance deteriorates dramatically. We will run our algorithm in more environments and provide the results in Appendix.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

How do these compare to the regret bounds of the paper at hand? - After the definition of regret, it is noted that the latter is random due to the randomness of M* (and the randomness of the algorithms and observations). It is not clear to me why M* is supposed to be random and not a fixed underlying MDP. - In the definition of the factored MDPs I did not understand the role of the set X. Does this correspond to a set of state-action pairs?